Summary

The following provides a description of single-cell RNA-seq analysis of motorneurons collected and sequenced from Crotalus atrox in 2018.

In this analysis, I examined expression at the *gene level* between locomotor and spinal cord cells to determine genes that were differentially expressed, using the tiger rattlesnake as a reference. We see 26 differentially expressed genes, which includes some interesting candidates, but no potassium channels, somewhat consistent in magnitude with an earlier analysis.

A final plot compares genes elected in another analysis for qPCR, showing that the ss-rnaseq data is consistent with those results.

Methods

Cell Collection

<BORIS NEEDS TO ADD METHODS DETAILS HERE>

Library preparation

Boris delivered 2 plates with 54 locomotor samples and 78 shaker samples to Alexander Janjic in May 2018 see (Boris_June2018.pdf). Elsewhere, the laboratory notes say that there were 54 locomotor cells and 62 shaker cells. In the notes, it says During centrifugation, 3 locomotor and 16 shaker samples were lost due to an accident. These numbers don’t add up.

Most plausibly, it looks like there should be 51 locomotor cells and 62 shaker cells in in the experiment.

Library was prepared using the mcSCRB-seq protocol v 1.1 (JWB Bagnoli et al. 2018).

i7 index N714 was used for locomotor cell plate, i7 index N715 was used for shaker cell plate.

Barcode sets per cell were unknown, requiring us to digitally demultiplex the cells…

Index Name Nextera Phenotype
GCTCATGA Boris1 N714 Locomotor
ATCTCAGG Boris2 N715 Shaker

Other laboratory specific notes are in the file Boris_June2018.pdf set from Alexander Janjic in June 2018.

Sequencing

The samples were then pooled and run on 0.5 lanes of XX instrument. (NEED MORE DETAILS FROM ALEXSANDAR, if he remembers…!)

Analysis

Raw single cell data was processed using the zUMIs pipeline (Parekh et al. 2018), which filters reads with low-quality barcodes and UMIs, and then uses the STAR aligner v2.7.3a (Dobin et al. 2012) to align reads to the Croatuls tigris genome (NCBI RefSeq GCF_016545835.1) (Margres et al. 2021). zUMIs then predicts cell barcodes and collapses UMIs to create a read count table for downstream analysis.

Read count tables were then analyzed using the (Hao et al. 2021) package to determine expression differences between cell pools.

All code for data analysis can be found at (https://github.com/msuefishlab/INSERT_REPO_NAME)

Results

Import ZUMIs Summary Data

rps<-read.table(file.path(root,"output_data/single_snakes/zUMIs_output/stats/single_snakes.readspercell.txt"),header = T)
tc<-data.frame(rps)
tc$type<-factor(tc$type)
tc$plate<-as.factor(substr(tc$RG,7,14))
tc$cell<-as.factor(substr(tc$RG,0,6))
levels(tc$plate)[1] <- "Bad"
levels(tc$plate)[2] <- "ATCTCAGG"
levels(tc$plate)<-c("Bad","Shaker","Locomotor")

featColors<-c("#1A5084", "#914614" ,"#118730","grey33","tan1","#631879FF","gold1","grey73","firebrick3")
names(featColors)<-c("Exon","Intron+Exon","Intron","Unmapped","Ambiguity","MultiMapping","Intergenic","Unused BC","User")


tc %>% 
  filter(plate != "Bad") %>%
  filter(type != "Unmapped") %>%
  ggplot(aes(y=N,fill=plate))+
        geom_density(alpha=0.7) +
        xlab("") +
        ylab("log10(Number of reads)") +
        scale_y_log10()+ 
        coord_flip()+
        scale_fill_manual(values=mycolors)+
        geom_hline(yintercept = 1e04)

Based on filtering out cells with < 10,000 reads, we get a number that looks similar to the number of input cells, with 10 more locomotor cells then we are supposed to have… let’s dig into this a bit more.

tc %>% filter(plate != "Bad" & N > 1e+04 & type != "Unmapped") %>%group_by(cell,plate) %>% summarise(Total=sum(N)) %>% group_by(plate) %>% summarise(n=n()) %>% kbl() %>% kable_styling()
plate n
Shaker 63
Locomotor 60

Import ZUMIs and Convert to Seurat Object

zumis_output <- readRDS(file.path(root,"output_data/single_snakes/zUMIs_output/expression/single_snakes.dgecounts.rds"))
umis <- as.matrix(zumis_output$umicount$inex$all)
seu <- Seurat::CreateSeuratObject(counts = umis)
seu@meta.data$celltype<-as.factor(substr(row.names(seu@meta.data),7,14))
levels(seu@meta.data$celltype)<-c("shaker","shaker","locomotor")

QC

Right out of the gate, zUMIs reduces the 192 possible barcodes from our file to to 162 potential cells (equally distributed between the two phenotypes). We know that there were less cells than this, suggesting that some of these cells represent “noise”…

seu@meta.data %>% group_by(celltype) %>% summarise(n=n()) %>% kbl() %>%
  kable_styling()
celltype n
shaker 81
locomotor 82
Identify Ribosomal Genes

We can see right away that most cells are ~ 20% ribosomal RNA, but locomotor cells have a long tail, suggesting some “bad” cells with low ribosomal RNA in them…

#grep("^RP[LS]",rownames(seu@assays$RNA@counts),value = TRUE)
PercentageFeatureSet(seu,pattern="^RP[LS]|LOC120298527|LOC120298533") -> seu$percent.Ribosomal

seu@meta.data %>%
    ggplot(aes(x=percent.Ribosomal, color = celltype, fill=celltype)) +
    geom_density(alpha = 0.7) +
    theme_classic()+
    scale_fill_manual(values=mycolors)+
    geom_vline(xintercept=12)

Calculate Percent Largest Gene
apply(
  seu@assays$RNA@counts,
  2,
  max
) -> seu$largest_count

apply(
  seu@assays$RNA@counts,
  2,
  which.max
) -> seu$largest_index

rownames(seu)[seu$largest_index] -> seu$largest_gene

100 * seu$largest_count / seu$nCount_RNA -> seu$percent.Largest.Gene
Calculate Read Counts

Again, we can see a pretty sharp bimodal distribution with a large number of cells < 10,000 reads and another large group > 10,000 reads…

seu@meta.data %>%
    ggplot(aes(x=nCount_RNA, color = celltype, fill=celltype)) +
    geom_density(alpha = 0.7) +
    theme_classic()+
    scale_x_log10()+
    scale_fill_manual(values=mycolors)+
    geom_vline(xintercept=1.4e04)

Let’s look at the number of unique genes found in each sample. Here we can see that shaker cells have a strange distribution: one population has about 1000 unique features which is not present in locomotor cells at the same frequency. In the majority of cells we detect approximately 7500 unique features (genes)…

seu@meta.data %>%
    ggplot(aes(x=nFeature_RNA, color = celltype, fill=celltype)) +
    geom_density(alpha = 0.7) +
    theme_classic()+
    scale_x_log10()+
    scale_fill_manual(values=mycolors)+
    geom_vline(xintercept=3000)

Filtering

Based on these values, we have a reasonable set of filters to propose:

filtered_seu <- subset(
  seu,
    nCount_RNA>1.4e04 &
    nFeature_RNA > 3000 &
    percent.Ribosomal > 12)

filtered_seu@meta.data %>% group_by(celltype) %>% summarise(n=n()) %>% kbl() %>%
  kable_styling()
celltype n
shaker 45
locomotor 45

This gives us a total of 90 cells to examine, let’s take a look at the distributions of cells a bit more closely (turning off log scales now…)

filtered_seu@meta.data %>%
    ggplot(aes(x=nCount_RNA, color = celltype, fill=celltype)) +
    geom_density(alpha = 0.7) +
    theme_classic()+
    scale_fill_manual(values=mycolors)+
    geom_vline(xintercept=1.4e04)+
    geom_vline(xintercept=1.25e05)

What I see here is a bit of “ringing” in both samples with very high read counts, which I think are referred to as “doublets” in ss-RNAseq land. Let’s get rid of those now too!

filtered_seu <- subset(
  seu,
    nCount_RNA>1.4e04 &
    nFeature_RNA > 3000 &
    percent.Ribosomal > 12 &
    nCount_RNA<1.25e05 )

filtered_seu@meta.data %>% group_by(celltype) %>% summarise(n=n()) %>% kbl() %>%
  kable_styling()
celltype n
shaker 41
locomotor 34

This gives us a grand total of 75 cells. Let’s have a look at the distributions of our QC data in this filtered dataset.

Idents(filtered_seu) <- 'celltype'
qc_plots<-VlnPlot(filtered_seu,features=c("nFeature_RNA", "nCount_RNA","percent.Ribosomal"),alpha(0.5),combine=FALSE)
for(i in 1:length(qc_plots)) {
  qc_plots[[i]] <- qc_plots[[i]] + scale_fill_manual(values=mycolors)+ theme(legend.position = 'none')
}
CombinePlots(qc_plots)

The distributions of our QC data seem to mostly overlap and make good biological sense. Let’s proceeed!

Clustering

What are the most variable genes in our dataset, potentially prediciting what cell type cluster they belong to?

filtered_seu <- NormalizeData(filtered_seu, normalization.method = "RC", scale.factor = 1e6)
filtered_seu <- FindVariableFeatures(filtered_seu, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(filtered_seu), 10)
plot1 <- VariableFeaturePlot(filtered_seu)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2

ScaleData(filtered_seu,features=rownames(filtered_seu)) -> filtered_seu
RunPCA(filtered_seu,features=VariableFeatures(filtered_seu)) -> filtered_seu
DimPlot(filtered_seu,reduction="pca",group.by="celltype") + scale_color_manual(values=mycolors)

I’m not seeing a super clean separation between shaker and locomotor cells, but there is sort of a trend here…

Differential Expression

Idents(filtered_seu) <- 'celltype'
markers<-FindMarkers(filtered_seu,ident.1 = "shaker", min.pct = 0.25)
sig_genes<-subset(markers, p_val_adj<0.1)
sig_genes$id<-rownames(sig_genes)
sig_genes<-as.data.frame(sig_genes)


sig_list<-merge(sig_genes,subset(gtf,type=="transcript") %>% group_by(gene) %>% filter(!duplicated(gene)) %>% select(gene,product),by.x="id",by.y="gene",all.x=TRUE)

#sig_list[,c(1,7,2,3,4,5,6)] %>% arrange(p_val_adj) %>% kbl() %>% kable_styling()

I’ve been able to identify 26 DE genes between shaker and locomotor cells. Of particular interest might be the following: NALCN (sodium leak channel) and splicing factor SWAP. Here are expression plots for all 26 genes.

# 1. Extract normalized expression values for significant genes
sig_gene_names <- rownames(sig_genes)
expression_values_norm <- filtered_seu[["RNA"]]@data[sig_gene_names, ]

# 2. Convert to a dense matrix and then to a data frame
expression_values_norm_matrix <- as.matrix(expression_values_norm)
expression_values_df <- as.data.frame(expression_values_norm_matrix)

# 3. Transpose the data frame so that cell barcodes become row names
expression_values_df_transposed <- as.data.frame(t(expression_values_df))

# 4. Add cell type information
cell_types <- filtered_seu$celltype
expression_values_df_transposed$celltype <- cell_types[rownames(expression_values_df_transposed)]

# 5. Reshape to long format
expression_values_long <- gather(expression_values_df_transposed, key = "gene", value = "expression", -celltype)

# 6. Arrange by feature and group by cell type, then calculate mean expression
expression_values_grouped <- expression_values_long %>%
  group_by(celltype, gene) %>%
  summarise(mean_expression = mean(expression, na.rm = TRUE))
## `summarise()` has grouped output by 'celltype'. You can override using the
## `.groups` argument.
# 7. Spread the data to wide format
expression_values_wide <- expression_values_grouped %>%
  spread(key = celltype, value = mean_expression)
  
final_table <- merge(expression_values_wide, sig_list, by.x = "gene", by.y = "id", all.x = TRUE)

final_table[,c(1,9,2,3,5,8)] %>% arrange(p_val_adj) %>% kbl() %>% kable_styling()
gene product shaker locomotor avg_log2FC p_val_adj
LOC120298205 uncharacterized LOC120298205 0.4533081 138.80952 -6.587978 0.0000000
PDE7A phosphodiesterase 7A, transcript variant X1 9.2299517 249.35524 -4.613105 0.0000000
LPAR6 lysophosphatidic acid receptor 6 3.6235663 62.65244 -3.783138 0.0000001
LOC120317132 galectin-6-like 1.6274441 54.09585 -4.390212 0.0000010
LOC120318694 uncharacterized LOC120318694, transcript variant X1 148.6662310 13.19015 3.398786 0.0000022
LOC120319595 vomeronasal type-2 receptor 26-like 1.9564457 67.20537 -4.527950 0.0000048
LOC120303167 splicing factor SWAP, transcript variant X1 326.5349748 73.06768 2.144733 0.0000089
EPB41L3 erythrocyte membrane protein band 4.1 like 3, transcript variant X5 403.9814590 810.76113 -1.003199 0.0001171
TUBB tubulin beta class I 217.5898320 573.89309 -1.395067 0.0001508
AK5 adenylate kinase 5, transcript variant X1 16.2693429 116.31100 -2.764053 0.0001856
ABCC9 ATP binding cassette subfamily C member 9, transcript variant X2 420.2175668 152.70942 1.454360 0.0002655
LOC120319596 vomeronasal type-2 receptor 26-like 130.3126594 340.53937 -1.379046 0.0015383
IKZF1 IKAROS family zinc finger 1, transcript variant X2 11.2096698 57.97901 -2.272177 0.0043207
PTPRM protein tyrosine phosphatase receptor type M, transcript variant X1 152.2070796 355.55021 -1.218622 0.0049298
LOC120307612 tubulin beta-2 chain 173.2001020 409.99067 -1.238360 0.0069229
SDC2 syndecan 2 29.1483760 136.82108 -2.192644 0.0085474
SEZ6 seizure related 6 homolog, transcript variant X3 148.0487694 53.44369 1.452948 0.0120074
NALCN sodium leak channel, non-selective, transcript variant X3 28.1239143 124.57693 -2.108295 0.0263756
HOXD9 homeobox D9 129.1150872 28.56036 2.138052 0.0265779
LZTR1 leucine zipper like transcription regulator 1, transcript variant X2 24.2292653 73.63468 -1.564748 0.0509420
CREB5 cAMP responsive element binding protein 5, transcript variant X7 51.6395444 172.70232 -1.722398 0.0538470
TRNAA-AGC-9 NA 0.0000000 22.16146 -4.533654 0.0603690
TPPP3 tubulin polymerization promoting protein family member 3, transcript variant X1 120.8431947 290.84059 -1.260155 0.0630456
LOC120301339 5S ribosomal RNA 6.4100732 75.64016 -3.370541 0.0792824
TUBB2A tubulin beta 2A class IIa 50.5104539 141.03587 -1.463318 0.0825205
CNTROB centrobin, centriole duplication and spindle assembly protein, transcript variant X4 47.6996137 10.58252 2.071961 0.0839091
myfeatures<-sig_list$id
plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=myfeatures,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.5) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

Here are the genes examined in our qPCR assay:

goi<-c("HOXD9","KCNQ3","KCNQ2","LOC120319564")

plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=goi,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.2) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

Finally, CHAT appears to be a good marker for motorneurons, it’s highly expressed and seems to be uniformly abundant in both samples…

goi_motorneuron_markers<-c("CHAT")

plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=goi_motorneuron_markers,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.2) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

Here are the additional genes requested by Boris:

goi<-c("NALCN")

plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=goi,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.2) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

## KV 7.2 and KV 7.3

goi_motorneuron_markers<-c("KCNQ3")

plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=goi_motorneuron_markers,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.2) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

goi_motorneuron_markers<-c("KCNQ2")

plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=goi_motorneuron_markers,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.2) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

New Heatmap Analaysis

kv_channels<-c("KCNH8","KCNH3","KCNH2","KCNH6","KCNH1","KCNH5","KCNQ3","KCNQ5","KCNQ4","KCNQ1","KCNQ2","KCNS3","KCNS1","KCNS2","KCNV1","KCNG1","KCNG4","KCNF1","KCNV2","KCNV3","KCND1","KCND2","KCND3","KCNB1","KCNB2","KCNC2","KCNC1","KCNC4","KCNC3","KCNA6","KCNA1","KCNA3","KCNA2","KCNA7","KCNA4","KCNA10","KCNA5")

kv_names<-c("Kv12.1","Kv12.2","Kv11.3","Kv11.1","Kv10.1","Kv10.2","Kv7.3","Kv7.5","Kv7.4","Kv7.1","Kv7.2","Kv9.3","Kv9.1","Kv9.2","Kv8.1","Kv6.1","Kv6.4","Kv5.1","Kv8.2","Kv6.3","Kv4.1","Kv4.2","Kv4.3","Kv2.1","Kv2.2","Kv3.2","Kv3.1","Kv3.4","Kv3.3","Kv1.6","Kv1.1","Kv1.3","Kv1.2","Kv1.7","Kv1.4","Kv1.8","Kv1.5")

kv_table <- data.frame(
  Names = kv_names,
  row.names = kv_channels
)

kv_filtered_seu<-subset(x=filtered_seu, features=kv_channels)

label_mapping <- setNames(kv_table$Names, rownames(kv_table))


#DoHeatmap(object=kv_filtered_seu,features=kv_channels,slot="counts")+  scale_fill_gradientn(colors=c("white", "yellow", "red"))+ scale_y_discrete(labels = label_mapping)

#VlnPlot(kv_filtered_seu,features=kv_channels, stack= TRUE, fill.by = "ident", flip=TRUE, sort="decreasing") +coord_flip() +scale_y_discrete(labels = label_mapping)

# Step 1: Calculate average expression values for each feature across all cell types
avg_expr <- AverageExpression(kv_filtered_seu, features = kv_channels, group.by = "celltype")
avg_expr <- avg_expr$RNA  # Replace 'RNA' with the appropriate assay name if different

# Calculate the average expression for each feature across all cell types
avg_expr_feature <- rowMeans(avg_expr, na.rm = TRUE)

# Step 2: Sort features by their average expression
sorted_features <- sort(avg_expr_feature, decreasing = TRUE)

# Step 3: Generate and sort plots
plots <- VlnPlot(kv_filtered_seu, group.by="celltype", split.by = "celltype", features=names(sorted_features), flip=TRUE, combine=FALSE, sort=TRUE)

# Number of columns in the combined plot
n_col <- 2

# Number of rows needed, assuming you fill column-wise
n_row <- ceiling(length(plots) / n_col)

# Generate and sort plots
plots <- VlnPlot(kv_filtered_seu, group.by="celltype", split.by = "celltype", features=names(sorted_features), flip=TRUE, combine=FALSE, sort=TRUE)

for(i in 1:length(plots)) {
  # Identify the last plot in each column
  last_in_columns <- seq(from = n_col * (n_row - 1) + 1, to = n_col * n_row, by = 1)
  
  # Extract the feature name from the plot title
  feature_name <- plots[[i]]$labels$title
  
  # Look up the corresponding protein name
  protein_name <- kv_table[feature_name, "Names"]
  
  # Remove x-axis and y-axis labels for all but the last plot in each column
  if (i %in% last_in_columns || i == length(plots)) {
    # Keep x-axis labels, remove y-axis labels and line
    plots[[i]] <- plots[[i]] + 
      ggtitle(paste0(protein_name,"(",feature_name,")")) +  # Replace title
      scale_fill_manual(values=mycolors) + 
      coord_flip() +
      ylim(0, 600) +
      scale_x_discrete(expand = c(0, 0)) +  # Start x-axis at 0
      theme(axis.line.y=element_blank(),  # Remove y-axis line
            axis.title.y=element_blank(),
            axis.text.y=element_blank(),
            axis.ticks.y=element_blank(),
            plot.title.position = "plot",
            plot.title = element_text(hjust = 0),
            legend.position="none")
  } else {
    # Remove both x-axis and y-axis labels and line
    plots[[i]] <- plots[[i]] + 
      ggtitle(paste0(protein_name,"(",feature_name,")")) +  # Replace title
      scale_fill_manual(values=mycolors) + 
      coord_flip() +
      ylim(0, 500) +
      scale_x_discrete(expand = c(0, 0)) +  # Start x-axis at 0
      theme(axis.line.y=element_blank(),  # Remove y-axis line
            axis.title.y=element_blank(),
            axis.text.y=element_blank(),
            axis.ticks.y=element_blank(),
            axis.title.x=element_blank(),
            axis.text.x=element_blank(),
            plot.title.position = "plot",
            plot.title = element_text(hjust = 0),
            axis.ticks.x=element_blank(),
            legend.position="none")
  }
}

# Combine the sorted plots
CombinePlots(plots, n_col)

References

Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. 2012. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21. https://doi.org/10.1093/bioinformatics/bts635.
Hao, Yuhan, Stephanie Hao, Erica Andersen-Nissen, William M. Mauck III, Shiwei Zheng, Andrew Butler, Maddie J. Lee, et al. 2021. “Integrated Analysis of Multimodal Single-Cell Data.” https://doi.org/10.1016/j.cell.2021.04.048.
JWB Bagnoli, Johannes, Christoph Ziegenhain, Aleksandar Janjic, Lucas Esteban Wange, Beate Vieth, Swati Parekh, Johanna Geuder, Ines Hellmann, and Wolfgang Enard. 2018. “mcSCRB-Seq Protocol V1.” http://dx.doi.org/10.17504/protocols.io.nrkdd4w.
Margres, Mark J., Rhett M. Rautsaw, Jason L. Strickland, Andrew J. Mason, Tristan D. Schramer, Erich P. Hofmann, Erin Stiers, et al. 2021. “The Tiger Rattlesnake Genome Reveals a Complex Genotype Underlying a Simple Venom Phenotype.” Proceedings of the National Academy of Sciences 118 (4). https://doi.org/10.1073/pnas.2014634118.
Parekh, Swati, Christoph Ziegenhain, Beate Vieth, Wolfgang Enard, and Ines Hellmann. 2018. “zUMIs - A Fast and Flexible Pipeline to Process RNA Sequencing Data with UMIs.” GigaScience 7 (6). https://doi.org/10.1093/gigascience/giy059.

  1. Michigan State University Department of Integrative Biology, ↩︎